I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.
Today we are looking at the Boston crime
| Variables | Description |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes divided by $1000s. |
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
require("easypackages") # for loading and installing many packages at one time
packages_to_load <- c("broom", "dplyr", "MASS", "tidyverse", "corrplot", "ggplot2", "GGally", "caret", "devtools", "ggthemes", "scales", "plotly")
packages(packages_to_load, prompt = TRUE) # lock'n'load install/load
#Additionally
#install_github("fawda123/ggord") # Installed from Github for vizualization
library(ggord)
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file
# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
More focused figure on correlation, since they are hardly visible with this many variables.
# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))
# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)
# number of rows in the Boston dataset
n <-nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- Boston_scaled[ind,]
# create test set
test <- Boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <-test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
summary(train)
## zn indus chas
## Min. :-0.48724 Min. :-1.51549 Min. :-0.27233
## 1st Qu.:-0.48724 1st Qu.:-0.86683 1st Qu.:-0.27233
## Median :-0.48724 Median :-0.19558 Median :-0.27233
## Mean :-0.01734 Mean : 0.02999 Mean :-0.03844
## 3rd Qu.: 0.04872 3rd Qu.: 1.01499 3rd Qu.:-0.27233
## Max. : 3.58609 Max. : 2.42017 Max. : 3.66477
## nox rm age
## Min. :-1.46443 Min. :-3.87641 Min. :-2.33313
## 1st Qu.:-0.87761 1st Qu.:-0.57910 1st Qu.:-0.79577
## Median :-0.14407 Median :-0.11334 Median : 0.33305
## Mean : 0.01955 Mean :-0.02494 Mean : 0.01982
## 3rd Qu.: 0.61319 3rd Qu.: 0.45525 3rd Qu.: 0.90413
## Max. : 2.72965 Max. : 3.55153 Max. : 1.11639
## dis rad tax
## Min. :-1.26582 Min. :-0.98187 Min. :-1.31269
## 1st Qu.:-0.80637 1st Qu.:-0.63733 1st Qu.:-0.75495
## Median :-0.32661 Median :-0.52248 Median :-0.39895
## Mean :-0.03654 Mean : 0.03555 Mean : 0.04215
## 3rd Qu.: 0.51946 3rd Qu.: 1.65960 3rd Qu.: 1.52941
## Max. : 3.95660 Max. : 1.65960 Max. : 1.79642
## ptratio black lstat
## Min. :-2.70470 Min. :-3.90333 Min. :-1.5296
## 1st Qu.:-0.48756 1st Qu.: 0.19715 1st Qu.:-0.7640
## Median : 0.29768 Median : 0.38442 Median :-0.1664
## Mean : 0.02534 Mean :-0.01071 Mean : 0.0273
## 3rd Qu.: 0.80578 3rd Qu.: 0.43330 3rd Qu.: 0.6266
## Max. : 1.63721 Max. : 0.44062 Max. : 3.5453
## medv crime
## Min. :-1.90634 low : 95
## 1st Qu.:-0.63692 med_low :103
## Median :-0.15579 med_high:100
## Mean :-0.03298 high :106
## 3rd Qu.: 0.25195
## Max. : 2.98650
Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2549505 0.2475248 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 0.94440483 -0.8860431 -0.1065564 -0.8565321 0.3760637
## med_low -0.06533749 -0.3106752 -0.0812077 -0.5700430 -0.1289125
## med_high -0.38347752 0.2070187 0.1607520 0.3788041 0.1350744
## high -0.48724019 1.0149946 -0.1237592 1.0386948 -0.4342557
## age dis rad tax ptratio
## low -0.8735571 0.8276950 -0.6953589 -0.7285311 -0.36259909
## med_low -0.3256663 0.3411869 -0.5458997 -0.4568972 -0.07184181
## med_high 0.4146562 -0.3977027 -0.3927076 -0.2881689 -0.33328026
## high 0.7836908 -0.8373930 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.38675456 -0.73929892 0.444285615
## med_low 0.34981973 -0.11922776 0.000127677
## med_high 0.08575804 -0.03644602 0.227810152
## high -0.80826193 0.91685446 -0.738929755
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.138009125 0.74028583 -1.09696055
## indus -0.003847934 -0.32548096 0.08754348
## chas -0.067664075 -0.06180598 0.04520960
## nox 0.285937634 -0.82885041 -1.24525429
## rm -0.086150700 -0.05590585 -0.09501534
## age 0.316543734 -0.32356275 -0.11492945
## dis -0.140728653 -0.36444382 0.29952893
## rad 3.255014878 0.80632853 -0.25830319
## tax -0.067812124 0.08209102 0.81565185
## ptratio 0.118808332 0.09934037 -0.28106515
## black -0.178445563 0.01398437 0.14550745
## lstat 0.171614667 -0.19522951 0.48709027
## medv 0.141566523 -0.47310859 -0.09784005
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9494 0.0384 0.0122
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g + theme_tufte() + geom_rangeframe()
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 13 2 0
## med_low 1 17 5 0
## med_high 1 9 16 0
## high 0 0 1 20
In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.
# k-means clustering
BostonS <- scale(Boston) # standardize variables
wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km <-kmeans(BostonS, centers = 4)
# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)
We see similar results than in LDA
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Now we can see the 3D picture of the clusters!